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Abstract 

Background: The InChl algorithms are written in C++ and not available as Java library. Integration into software 
written in Java therefore requires a bridge between C and Java libraries, provided by the Java Native Interface (JNI) 
technology. 

Results: We here describe how the InChl library is used in the Bioclipse workbench and the Chemistry Development 
Kit (CDK) cheminformatics library. To make this possible, a JNI bridge to the InChl library was developed, JNI-lnChl, 
allowing Java software to access the InChl algorithms. By using this bridge, the CDK project packages the InChl 
binaries in a module and offers easy access from Java using the CDK API. The Bioclipse project packages and offers 
InChl as a dynamic OSGi bundle that can easily be used by any OSGi-compliant software, in addition to the regular 
Java Archive and Maven bundles. Bioclipse itself uses the InChl as a key component and calculates it on the fly when 
visualizing and editing chemical structures. We demonstrate the utility of InChl with various applications in CDK and 
Bioclipse, such as decision support for chemical liability assessment, tautomer generation, and for knowledge 
aggregation using a linked data approach. 

Conclusions: These results show that the InChl library can be used in a variety of Java library dependency solutions, 
making the functionality easily accessible by Java software, such as in the CDK. The applications show various ways 
the InChl has been used in Bioclipse, to enrich its functionality. 

Keywords: InChl, InChlKey, Chemical structures, JNI-lnChl, The Chemistry Development Kit, OSGi, Bioclipse, Decision 
support, Linked data,Tautomers, Databases, Semantic web 



Background 

It is of great importance that chemical structures can be 
serialized in standard formats in order to enable exchange 
and linking of chemical information. The IUPAC Chemi- 
cal Identifier (InChl) [1] is such a standardized identifier 
for chemical structures, which lately has seen a great 
adoption in the cheminformatics community [2]. A recent 
special issue details this further [3]. Two important use 
cases are querying for exact matches in databases, and 
linking chemical structures using semantic web technolo- 
gies. The official implementation of InChl is in C as a 
library, in order to provide a single implementation that 
everyone can use. This however limits its use in other 
programming languages such as Java. We here describe 
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the packaging of InChl in Java, to enable frameworks 
and applications written in this language, like the appli- 
cations mentioned in this paper, Biojava [4], JOELib [5], 
and JChem [6], to take advantage of the benefits of InChl. 
We present the integration of InChl in the cheminfor- 
matics library the Chemistry Development Kit as well 
as the graphical workbench Bioclipse. We also provide 
demonstrations where InChl is used in decision support 
for chemical liability assessment, for tautomer genera- 
tion, and for knowledge aggregation using a linked data 
approach. 

Implementation 

Packaging InChl in Java Archives and Maven bundles 

JNI-lnChl is the packaging of the InChl libraries in 
portable Java libraries using the Java Native Interface 
(JNI), available on Sourceforge under GNU Lesser Gen- 
eral Public License 3.0 (LGPL) [7], The JNI-lnChl library 
provides native binaries of the InChl library for 32- and 
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64-bit Windows, Linux and Solaris, 64-bit FreeBSD and 
64-bit Intel-based Mac OS X, covering the most com- 
mon platforms on which the CDK and Bioclipse are run. 
The library is available as a regular Jar Archive (Jar file), 
as Maven bundle from the JNI-InChI project website at 
http://jni-inchi.sf.net/. 

Provisioning of InChl as OSGi bundles 

While Maven makes library dependency management a 
lot easier, it is not the only platform to do so. OSGi 
[8] is another standard for dynamic module system in 
Java, allowing for easy provisioning and interoperabil- 
ity of modules, mainly containing compiled Java code 
but also associated data. The Bioclipse project has devel- 
oped OSGi bundles for InChl by wrapping the JNI-InChI 
libraries, which required some modifications to e.g. class 
loaders. The OSGi bundles are available from a p2 repos- 
itory for easy provisioning and integration. Having OSGi 
bundles with InChl enables easy access from all plug- 
ins supporting this module technology. Cheminformatics 
tools that makes use of the OSGi module system includes 
KNIME [9], Cytoscape (as of version 3) [10], Taverna 
[11,12], and Bioclipse [13]. More information and the 
bundles can be found at http://www.bioclipse.net/inchi- 
osgi. 

The JNI-InChI API 

The JNI-InChI library is written to directly make calls to 
the InChl library. That is, it will make library calls directly, 
rather than using a command line to access the library. 
To make this possible with JNI, it defines a JnilnchiWrap- 
per class which has a Java API of which some methods 
are written in Java, and some call native methods in the 
matching JnilnchiWrapper.c class that directly calls the 
C++ InChl library. This wrapper allows the JNI-InChI 
user to set up a proper data model for the chemical struc- 
ture for which the InChl should be calculated, and to set 
the generation options, allowing users to select, for exam- 
ple, which InChl layers should be generated or if just a 
standard InChl should be calculated. 

The code subset of the API of the JnilnchiWrapper and 
JnilnchiStructure classes is given in Table 1. Using this API 
we can, for example, calculate the InChl string for ethane 
(without non-default options; in Java): 

Jnilnchilnput input=new Jni Inchi Input ("") ; 
JnilnchiAtom al = input . addAtom ( 
new JnilnchiAtom ( 

0.000, 0.000, 0.000, "C" 

) 

) ; 

al . setlmplicitH (3 ) ; 
JnilnchiAtom a2 = input . addAtom ( 
new JnilnchiAtom ( 



0.000, 0.000, 0.000, "C" 

) 

) ; 

a2 . setlmplicitH (3 ) ; 
input . addBond ( 

new JnilnchiBond ( 

al, a2, INCHI_BOND_TYPE . SINGLE 




JnilnchiOutput output = 

JnilnchiWrapper . getlnchi (input) ; 

System . out . println ( 

"The InChl for ethane is: " + 
output . getlnchi ( ) 

) ; 

The full API is available as HTML JavaDoc at http://jni- 
inchi.sourceforge.net/apidocs/. What the API does not do, 
is support input of chemical structures from chemical file 
formats, such as the MDL molfile format supported by the 
InChl library itself. Instead, JNI-InChI encourages chem- 
informatics libraries to use converters that translate their 
internal data structure into the JNI-InChI data structure, 
using the methods of the Jnilnchilnput class. One library 
taking this approach is the CDK. 



Table 1 Various java methods from the JnilnchiWrapper 
class 



JnilnChlWrapper 



loadLibraryO 


Loads the InChl library suitable for the 




platform. 


getlnchi(Jnilnchilnput) 


Generates an InChl for the given input 




structure, with the InChl options passed 




with the input. 


getStdlnchi(Jnilnchilnput) 


Generates a Standard InChl for the given 




input structure. 


getStructureFromlnchi 


Generates a structure from an InChl string 


(Jnilnchilnputlnchi) 


(without coordinates). 


getlnchiKey(String) 


Converts an InChl into an InChlKey. 


checklnchi(String, 


Check the validity of a (non-standard) InChl 


boolean) 


either loosely or strict. 


checklnchiKey(String, 


Check the validity of a (non-standard) 


boolean) 


InChlKey either loosely or strict. 


Jnilnchilnput 


Jnilnchilnput(List) 


Constructor allowing you to set the InChl 




generation options as a List of Strings. 


addAtom(JnilnchiAtom) 


Adds an atom to the input structure. 


addBond(JnilnchiBond( 


Adds a bond to the input structure. 


addStereoOD 


Adds a tetrahedral, bond, or allene 


(JnilnchiStereoOD) 


stereochemistry element to the input 




structure. 
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Integration of JNI-lnChl into the CDK 

The primary purpose of the integration of the JNI-InChI 
into the CDK is to allow the translation of the CDK data 
structure into that of JNI-InChI. Using this approach, we 
can convert the content of any chemical file format the 
CDK supports into InChls, overcoming limitations of the 
InChI library in terms of supported file formats. 

While JNI-InChI supports the full range of function- 
ality of the InChI C library, structure-to-InChl, InChl- 
to-structure, Auxlnfo-to-structure, InChlKey generation, 
and InChI and InChlKey validation, not all of this func- 
tionality is available in the CDK library, in version 1.4.13 
and later. 

The CDK-to-JNI-InChI bridge supports the following 
layers: the connectivity layer, tetrahedral and double bond 
stereochemistry layers, the isotope layer, and the charge 
layer. Additionally, the CDK API for generating InChls 
allows the use of various options, so that standard InChls 
and non-standard InChls can be generated. For example, 
an InChI with the fixed hydrogen layer can be calculated 
with the Java code: 

InChlGeneratorFactoryf actory = 

InChlGeneratorFactory . getlnstance ( ) ; 

generator = factory . getlnChlGenerator ( 
mierezuur, "FixedH" 

) ; 

System . out . println ( 
generator . getlnchi ( ) 

) ; 

The CDK uses this functionality further for gener- 
ate tautomers, as proposed by Thalheim et al. [14], and 
demonstrated later in this paper. Another feature is that 
the InChI library can be use to generate canonical atom 
numbers, which is done with the InChlNumbersTools 
class. 

Integration of InChI in Bioclipse 

Bioclipse is a workbench for the life sciences where chem- 
informatics is the most developed functionality. Key fea- 
tures of Bioclipse includes import, export and editing of 
chemical structures in various file formats, as well as visu- 
alizations and various property calculations - all features 
available from both a graphical workbench as well as a 
built-in scripting language (Bioclipse Scripting Language, 
or BSL) [15,16] and lately via a link to the statistical pro- 
gramming language R [17]. As a Rich Client built on 
the Eclipse Rich Client Platform (RCP), Bioclipse inherits 
an extensible architecture implementing the OSGi stan- 
dard. By adding the previously described InChI OSGi 
bundles to Bioclipse, Bioclipse exposes InChI calculation 
as a key feature in the workbench, and InChI is calcu- 
lated on all structure modifications and visualized as a 



general property in the workbench window (see Figure 1). 
Bioclipse supports both the generation of standard and 
non-standard InChls, and a preference allows for selecting 
between the different versions. An example in BSL is: 

mol=cdk . f romSMILES ( "0C=0" ) 

sinchi = inchi . generate (mol ) ; 

inchi = inchi . generate (mol , "FixedH"); 

Results and discussion 

The applications below have additional information on 
how to install and perform them available on: http://www. 
bioclipse.net/inchi. 

Applications of InChI in cheminformatics 

a) Decision support in computational pharmacology 

In chemical safety assessment, the first step when faced 
with a new chemical structure is to see weather it already 
has been synthesized, and if any in vitro assays or in 
vivo studies have been performed. Given the large size of 
knowledge bases in companies and organizations, exact 
database lookups have become ubiquitous tools and used 
on a daily basis. Bioclipse Decision Support provides a 
framework for running exact match queries against a 
library of chemical structures, which was demonstrated 
for 3 open safety endpoints [18]. An example query can be 
seen in Figure 2. 

b) Linked data spidering in Bioclipse with Isbjern 

Molecular structures on the internet can be searched 
using InChI and InChlKeys [21] directly. However, they 
can also be used as seed to spider (the process of fol- 
lowing links on the world wide web) the Linked Data 
section of the World Wide Web [22]. We developed a 
plugin to Bioclipse that searches the Internet for infor- 
mation about a molecule, initiated with the InChI and 
a web service we developed earlier, providing Universal 
Resource Identifiers for molecules, available at http://rdf. 
openmolecules.net/ [23]. This service provides a number 
of initial links to other Linked Data resources, and links 
to other resources are followed using owbsameAs and 
skos:exactMatch predicates. 

While spidering the web of molecular information, com- 
mon ontologies are recognized and use to extract informa- 
tion about the compound. Recognized ontologies include 
general ontologies like Dublin Core (http://dublincore. 
org/), RDF Schema [24], SKOS [25], and FOAF [26], as 
well as domain specific ontologies, like ChemAxiom [27], 
CHEMINF [28], and specific predicates used by specific 
databases, including Bio2RDF [29], DBPedia [30], and 
ChemSpider [31] (see Figure 3 left). 

But by educating Isbjorn about further ontologies we 
can even, for example, extract drug side effects from the 
SIDER database [32], as exposed by the Free University 
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i_j Properties 23 


|B|» ttf""D] 


Property 


Value 


▼ Ceneral 




Has 2D Coords 


yes 


Has 3D Coords 


no 


InChl 


lnChl=lS/C15H12N20/cl6-15(18)17-13-7-3-l-S-ll(13)9-l... 


InChNCey 


FFGPTBCBLSHEPO-UHFFFAOYSA-N 


Molecular Format 


MDL Molfile (2D) 


Molecular Formula 


C15H12N20 


Molecular Mass 


236.2691 


SMILES 


0=C(N)N2c3ccccc3(C=Cclcccccl2) 



















Figure 1 Part of the Bioclipse workbench showing the chemical structure for the drug carbamazepine.The InChl and InChlKey are displayed 
as properties in the bottom canvas. Editing the chemical structure instantly triggers a recalculation of these properties. 



Berlin RDF services, as shown in Figure 3 right. The search 
results of Isbjorn are presented in Bioclipse as a HTML 
page and opened in a browser window (not shown). 

c) CDKtautomer calculation in Bioclipse 

The InChl library can also be used to generate tautomers 
[14]. This method has been implemented in the CDK by 
Rijnbeek [33], and exposed in the Bioclipse Scripting Lan- 
guage. Tautomers can be calculated for any molecule, for 
example, created from a SMILES string in this example for 
phenol: 

//no aromatic rings that make it hard to 
// see where the double bonds are 
j cpglobal . setShowAromaticity (false) ; 

input SMILES = "clccccclO"; 
inputName = "phenol"; 
inchi . generate ( 

cdk . f romSMILES ( input SMILES ) 

) 

tautomers = cdk . getTautomers ( 
cdk . f romSMILES ( inputSMILES ) 

) 



file = "/Virtual/" + inputName + ".sdf"; 
cdk. saveSDFile (file, tautomers) ; 
ui . open (file) ; 

Using this approach we can generate tautomers for 
any molecules, though it is limited by the heuristic rules 
implemented by the InChl library. We typically only find a 
subset of tautomers, rather than a full set. For example, for 
warfarin it finds only six tautomers out of the 40 reported 
ones [34]. 

Conclusions 

The InChl project has chosen the path to rely on a sin- 
gle implementation for standardizing InChl calculations, 
and it is important that this code is readily available for 
all cheminformatics software development. This paper 
describes the packaging of InChl as a Java library using a 
JNI bridge ( JNI-InChI), which is available as a Java Archive 
(jar file), and as Maven bundles. It further shows the inte- 
gration into the CDK library and how the JNI-InChI as 
OSGi bundles renders InChl easily available for software 
using this dynamic module system, such as the Bioclipse 
workbench. The various binary packages make the InChl 
library easily usable in a variety of Java environments. 
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jU Decision Support S3 



□ Properties I 



Property 



▼Ceneral 

Has 2D Coords 
Has 3D Coords 
InChl 
InChlKey 
Molecular Format 
Molecular Formula 
Molecular Mass 
SMILES 



yes 
no 

lnChl=lS/C14H804/clS-9-S-l-3-7-ll(9)14(18)12- 
QBPFLU LOKWLN NW-UH F FF AO YSA- N 
MDL Molfile (2D) 
C14H804 
240.2114 

0-Clc3cccc(0)c3(C(-0)c2c(0)ccccl2) 



= □ 



| El B|® 



T^AHR 

$ AHR Signature Alerts (excluded] 

Q AHR Signature Significance (excluded) 

AHR exact matches (no hits! 
Q AHR nearest neighbour (excluded] 

▼ • Carcinogenicity 

Q CPDB Signature Alerts (excluded] 

Q CPDB Signature Significance (excluded] 

▼ • CPDB exact matches [1 pos] 

0 Index 185 
QCPDB nearest neighbour (excluded] 

▼ • Mutagenicity 

Q Ames Signature Significance (excluded) 
Q Ames Structural Alerts (excluded] 

▼ # Ames exact matches (1 pos] 

© 117-10-2 
Q Ames nearest neighbour (excluded) 




Figure 2 Part of the Bioclipse workbench showing the Decision Support feature. It shows three exact matches enabled (right canvas) and the 
chemical structure of the withdrawn drug danthron. We see that the data sets for CPDB [1 9] and Ames Mutagenicity [20] both gives an exact match, 
and that this compound has previously been shown to be positive (mutagen) in an Ames Mutagenicity test as well as positive for an in vivo 
carcinogenicity test included in the Carcinogenicity Potency Database. 



A feature of the InChl is that it supports various 
layers of detail in describing the chemical structure, 
which has confused end users of cheminformatics soft- 
ware. This resulted in a set of chosen layers, resulting 
in the standard InChl. The CDK supports generation 
and processing of both the standard and non-standard 
InChls. Bioclipse provides a preference page where users 
can indicate which InChl they like to be calculated 
by default. 



The uses in the CDK and Bioclipse have shown that the 
InChl is of great utility for uniquely identifying molecular 
structures in a canonical form, and is therefore well suited 
for exact matches in database searches, as exemplified in 
computational pharmacology example. This makes it also 
highly suitable for mining the internet and the Linked 
Data network. We demonstrate this with our Isbjorn 
plugin for Bioclipse, which aggregates knowledge about 
chemical compounds from an increasing list of disparate 



dbpedia.org 


www4.wivviss.fu-berlin.de s 


Is a drug &>, drug product 


Is a drug ff 




Synonyms ACETYLSALICYLIC ACID 


Synonyms 7 -tr f->u+f ^;U$@ja, Acide acetylsalicylique@fr, Ac 
AueTMncanML|MnoBan KncnoTa@ru, Aspirin, Acetylsalicylzuur@nl, Acel 
acetylosalicylowy@pl, Acetylsalisylsyre@no, Asetyylisalisyylihappo@f 
acetilsalicilico@es, Acido acetilsalicilico@pt, v ■ •» ©zh 

Homepage 
Melting point 135 

Bioavailability Rapidly and completely absorbed 
Administration Most commonly oral, also rectal. Lysine acetylsalicyl; 
Boiling point 140 


Homepage 

Side Leukopenia tf, Toothache ff, sweating 0, Nausea &, Fractures ff, Synovitis ff, Urticaria 
effects ff, Hearing loss ff, Rheumatic Fever c?, cerebral infarction tf, neck pain &, tinnitus &, 
Headache fi, Thrombocytopenia &, melena &, Anemia ff, myositis tf>, Osteoarthritis &, 
Somnolence ff, cerebrovascular accident iff, Rheumatoid Arthritis &, Dysmenorrhea @, 
Confusion fi, Anaphylaxis ff, photophobia ff, MIGRAINE £>, Carotid Endarterectomy &, 
trauma &, Angioedema fi, Hypersensitivity^, Diarrhea iff, Purpura iff, MYOCARDIAL 
INFARCTION ff, PMS ff, Rheumatism &, INFLUENZA ff, transient ischemic attack iff, 
sprains ff, ASTHMA iff, Vertigo iff, Bursitis iff, ANGINA PECTORIS iff, Arthritis iff, Fever 
tf>, Vomiting tf>, Ulcer iff, Pain tf>, Spondylitis ff, Dyspepsia ff, common cold iff, 
hematemesis @, heartburn ff, gastrointestinal hemorrhage iff, pruritus iff, neuralgia iff 


Figure 3 Screenshot of Linked Data spidering results by Isbjorn presented as a HTML page. 
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sources. The use of the InChI here shows the potential 
for the common task to collect as much information as 
possible about a novel chemical structure, uniquely iden- 
tified by the InChI. But the use of the InChI algorithms 
is not limited to that purpose, and has further benefits. 
We demonstrate this with the exposure in the CDK and 
Bioclipse to generate tautomers. 

Our results show that it is possible to overcome the 
problem that the InChI algorithm is not implemented in 
Java, but this however comes at a price. Using non-Java 
code in a Java environment requires a bridge, for which 
we used JNI, but crossing this bridge is computation- 
ally expensive. Furthermore, the integration into the CDK 
requires bridging two data models: one for the CDK and 
one for the InChI library. A suite of unit tests is in place to 
validate that information is correctly translated from the 
CDK data model into calculated InChls. However, a full 
validation using the InChI project test suite has not been 
completed yet. 

Availability and requirements 

• Project Name: JNI-InChI 

Project home page: http://jni-inchi.sourceforge.net/ 

Operating system(s): Windows, GNU/Linux, OS/X 

Programming language: C and Java 

Other requirements (if compiling): InChI library 

License: GNU LGPL v3 or later 

Any restrictions to use by non-academics: None 

additional 

• Project Name: The Chemistry Development Kit 
Project home page: http://cdk.sourceforge.net/ 
Operating system(s): Platform independent 
Programming language: Java 

Other requirements (for the InChI module): 
JNI-InChI 

License: GNU LGPL v2.1 or later 

Any restrictions to use by non-academics: None 

additional 

• Project Name: Bioclipse 

Project home page: http://www.bioclipse.net/ 

Operating system(s): Windows, GNU/Linux, OS/X 

Programming language: Java 

Other requirements (for InChI functionality): 

JNI-InChI, The Chemistry Development Kit 

License: Eclipse Public License 

Any restrictions to use by non-academics: None 

additional 
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